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Data mining is an essential process for identifying the patterns in large 
datasets through machine learning techniques and database systems. 
Clustering of high dimensional data is becoming very challenging process 
due to curse of dimensionality. In addition, space complexity and data 
retrieval performance was not improved. In order to overcome the limitation, 
Spectral Clustering Based VP Tree Indexing Technique is introduced. The 
technique clusters and indexes the densely populated high dimensional data 
points for effective data retrieval based on user query. A Normalized Spectral 
Clustering Algorithm is used to group similar high dimensional data points. 
After that, Vantage Point Tree is constructed for indexing the clustered data 
points with minimum space complexity. At last, indexed data gets retrieved 
based on user query using Vantage Point Tree based Data Retrieval 


Algorithm. This in turn helps to improve true positive rate with minimum 
retrieval time. The performance is measured in terms of space complexity, 
true positive rate and data retrieval time with El Nino weather data sets from 
UCI Machine Learning Repository. An experimental result shows that the 
proposed technique is able to reduce the space complexity by 33% and also 
reduces the data retrieval time by 24% when compared to state-of-the-art- 
works. 
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1. INTRODUCTION 

Clustering is a major task to group the similar high dimensional data in data mining and used in 
large number of real applications such as such as weather forecast, share trading, medical data analysis, aerial 
data analysis and so on. Data mining (DM) is used to extract useful information from large amount of data. 
High-dimensional data are wide-ranging in several areas of machine learning, signal and image processing, 
computer vision, pattern recognition, bioinformatics and so on. The high-dimensionality of the data increases 
the computational time and memory requirements and also significantly changes their performance due to 
inadequate number of samples. Therefore, a great demand in high dimensional data handling is to cluster the 
data along with their user requirements. The several data mining technique has been developed to show the 
major issues in the field of high dimensional data clustering. 

Locality sensitive hashing (LSH) techniques was designed in [1] addressed near-neighbor search 
issues for high-dimensional data. However, the true positive rate was not improved using LSH techniques. 
An incremental semi supervised clustering ensemble approach (ISSCE) was introduced in [2] with gain of 
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the random subspace technique and the constraint propagation approach to perform the high dimensional data 
clustering. But, the data retrieval process was not carried out in efficient manner. 

A discriminative embedded clustering framework was introduced in [3] for clustering the high 
dimensional data that joins subspace learning and clustering. Though formulated nonconvex optimization 
issues were addressed, the framework was not suitable in supervised cases. A stratified sampling method was 
presented in [4] for generating subspace component datasets. But, the space complexity was not reduced 
using stratified sampling method. A new ranking-based hashing framework was introduced in [5] maps the 
data from various modalities into hamming space where cross-modal similarity are calculated by hamming 
distance. Though the space complexity was reduced, the data retrieval process was complicated. A new fuzzy 
c-means (FCM) model with sparse regularization was introduced in [6] through reformulating the FCM 
objective function into weighted between-cluster sum of square form and required the sparse regularization 
on weights. But, data retrieval time was not reduced using FCM model. Interesting Subspace Clustering 
(ISC) algorithm was presented in [7] utilized the attribute dependency measure from Rough Set theory to 
recognize the subspaces. However, it failed to handle the problem of densely populated data points. Model- 
based clustering latent trait (MCLT) models was introduced in [8] with block effect present suitable 
alternative for sampled data. The MCLT mode was not considered space and time complexity during the 
clustering process. 

Predictive Subspace Clustering (PSC) was introduced in [9] for clustering the high-dimensional 
data. However, PSC is not suitable for clustering of densely populated high dimensional data points. An 
efficient high-dimensional indexing library called HDIdx was introduced in [10] for estimated NN search. It 
transformed the input high-dimensional vectors into compact binary codes in efficient and scalable manner 
for NN search with lesser space complexity. Though space complexity was reduced, data retrieval was not 
carried out in efficient manner. Mahalanobis distance based local distribution oriented spectral clustering 
technique was developed in [11] to group the data in dimensional space. However, data retrieval was not 
carried out. In order to overcome the above mentioned issues such as less true positive rate, high space and 
time complexity during clustering, lack of data retrieval, handle densely populated data points and so on. In 
order to overcome such kind of issues, Spectral Clustering based Vantage Point Tree Indexing (SC-VPTI) 
Technique is introduced. The SC-VPTI technique is designed for efficient data retrieval based on the user 
query with minimum time. 

The contribution of our research work includes as follows: a Spectral Clustering Based VP Tree 
Indexing (SC-VPTI) Technique clusters and indexes the densely populated high dimensional data points for 
efficient data retrieval based on the user query. The SC-VPTI technique contains three major contributions. 
At first, a Normalized Spectral Clustering Algorithm clusters the similar high dimensional data points based 
on similarity score of data points. Second, Vantage Point Tree indexes the clustered high dimensional data 
points for efficient data retrieval. The indexed data points are represented by a circle. The VP indexing 
reduces the space complexity for storing the multiple high dimensional data points. At last, the indexed 
similar data points gets retrieved from the indexing tree based on the user query with the help of Vantage 
Point Tree based Data Retrieval Algorithm. As a result, SC-VPTI technique achieves higher true positive rate 
with minimum data retrieval time. The rest of the paper organized as follows. In Section 2, the proposed SC- 
VPTI technique is described with the help of structural diagram. In Section 3, experimental evaluation is 
discussed and result analysis is carried out with tables and graph in Section 4. A summary of different 
clustering techniques for high dimensional data is reviewed in Section 5. The Section 6 concludes the 
presented works. 


2. SPECTRAL CLUSTERING BASED VP TREE INDEXING TECHNIQUE 

The Spectral Clustering Based VP Tree Indexing (SC-VPTI) Technique is introduced to cluster and 
index the densely populated high dimensional data points for effective data retrieval based on the user query. 
SC-VPTI technique is used for clustering the dense data points and increases the data retrieval rate. SC-VPTI 
technique introduces Normalized Spectral Clustering Algorithm to group the similar high dimensional data 
objects. Then, SC-VPTI technique constructs Vantage Point tree for indexing the clustered data points to 
form the indexing database with minimum space complexity. Finally, SC-VPTI technique uses Vantage Point 
tree based data retrieval algorithm to extract the user requested data from indexing database with lesser data 
retrieval time consumption. The overall structural design of SC-VPTI Technique for clustering the densely 
populated high dimensional data points is described in Figure 1. 

From Figure 1, SC-VPTI Technique initially collects the densely populated high dimensional data 
points from El Nino weather dataset as input which comprises collection of densely populated high 
dimensional data points. Then, SC-VPTI Technique designed normalized spectral clustering algorithm for 
clustering the data points from high dimensional database. Then, SC-VPTI Technique constructs Vantage 
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Point tree for indexing the clustered high dimensional data points with minimum space complexity. Finally, 
SC-VPTI technique performs Vantage Point Tree based data retrieval process to retrieve the user requested 
data with lesser data retrieval time. The brief description of normalized spectral clustering and vantage point 
tree indexing are described in upcoming section. 


ee Clustering 
Groups the similar data points based on data type 
Indexes clustered data points 


Effective Data Retrieval based on user query using Vantage 
Point Tree 
N 


El Nino weather Data 
Set 


Figure 1. Overall structural design of spectal clustering based VP tree indexing technique 


2.1. Normalized spectral clustering algorithm 

In SC-VPTI technique, normalized spectral clustering techniques uses spectrum (eigenvalues) of 
similarity matrix of high dimensional data. The similarity matrix is given as an input. The similarity matrix 
comprises a quantitative estimation of the relative similarity for each pair of high dimensional data in dataset. 
Spectral Clustering is to form a pairwise similarity matrix ‘S’, compute Laplacian matrix ‘L’ and 
eigenvectors of ‘L’. The eigenvector of normalized graph Laplacian is relaxation of binary vector result that 
reduces normalized cut on graph. The Normalized Spectral Clustering process for grouping similar data 
points is shown in below Figure 2. 


El Nino weather dataset 


Construct similaritv matrix 
Construct Laplacian matrix 


Determine Normalized Laplacian matrix 


Compute first K eigenvectors 
Employ k-means to cluster data point 


Grouping of data points based on data type 


ee 


Figure 2. Process of normalized spectral clustering for similar grouping data points 


From Figure 2, normalized spectral clustering process is described for grouping the similar data 
points based on the data type. Initially, the similarity matrix gets constructed and then laplacian matrix is 
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structured based on the similarity score obtained from the data points. In addition, normalized spectral 
clustering matrix is constructed by changing row values of previously constructed matrix. Then, normalized 
spectral clustering matrix used k-means algorithm to cluster the data points. Finally, similar high dimensional 
data points are grouped to form k-number of clusters such as sea surface temperatures, relative humidity, 
rainfall, subsurface temperatures, air temperature data and so on. 

In SC-VPTI technique, Let {X;} be the set of data points where ‘i’ varies from the value ‘1,2,3 ...n’ 
in densely populated high dimensional data (i.e., El Nino weather dataset). Densely populated high 
dimensional dataset is represented by an undirected graph G(V, E) where V denotes the set of vertices (i.e. 
data points) and E indicates the edge relationship of a pair of data points. Initially, the similarity matrix is 
described as the symmetric matrix ‘A’. The degree of ‘it? data point in high dimensional dataset is 
mathematically formulated as, 


di = Xij=1 Aij (1) 


From Equation (1), A; j denotes the similarity matrix between two (i.e., X; and X;) data points from densely 
populated high dimensional data. ‘d;’ denote the degree of ‘i‘"’ data point. In spectral clustering process, the 
pair-wise similarity is identified with help of a similarity function. The Gaussian kernel function is one of the 
most commonly used similarity functions. The similarity between two data points X; and X; is measured 
based on type of data with help of Gaussian kernel function. Gaussian kernel function in spectral clustering is 
used to calculate the similarity score between two data point and it is formulated as, 


2 
Aj) = exp eel ifi +j and A;; =0 (2) 


From (2), similarity matrix is constructed for each data point in which Ix i— X All represents the Euclidean 
distance between two data points X; and Xj. Here, parameter o manages the width of the neighborhood. 
The diagonal matrix ‘D’ is defined as the matrix with degrees d4, dz, d3,....,dy_1, dn on the diagonal. In 
diagonal matrix, (i,i)"" element is the sum of A’s in the i row. Diagonal matrix is attained by, 


d 0 00 0 0 
0da 0 0 0 0 
J]o 0&0 0 0 
D0) OO. 0 0 3) 
0o 0 0 0 d4 O0 
0o 0 00 0 h 


From (3), the diagonal matrix is obtained. After obtaining the diagonal matrix, the unnormalised Laplacian 
matrix is constructed with data points and given by, 


L=D-A (4) 
From (4), ‘L’ is the Laplacian matrix, D represents diagonal matrix and ‘A’ denotes the similarity matrix. 
Then, the first ‘K’ largest eigen values of Laplacian matrix and their corresponding eigenvectors 
(V1, Vz, V3, ... Vg) in columns is determined and the matrix ‘Z’ is constructed by, 


Z = v4, 05,05, 52 (5) 


From (5), Z matrix is constructed. Then, normalized Laplacian matrix ‘Y’ is constructed through 
renormalizing each row value of ‘Z’ matrix. The normalized Laplacian matrix is constructed by, 


Y= = (6) 
(2522; 


YS 


From (6), each row of Y acts as a vertex and cluster them into many K clusters by using k-means clustering 
algorithm. K-means cluster algorithm is carried out within cluster sum of squares by, 


: 2 
arg min XE Èx eca ll% — Hall a 
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From (7), k number of clusters are formed, ‘Ha’ represents the cluster mean and ‘X,’ symbolizes the data 
points. The algorithmic process of normalized spectral clustering algorithm is given below, 


Algorithm 1. Normalized Spectral Clustering Algorithm 
\\Normalized Spectral Clustering Algorithm 

Input: Set of data points‘ {X;}’, Cluster Number K. 
Output: Grouping of data points in different cluster 

Step 1: Begin 

Step 2: For each data point in El Nino weather dataset 


Step 3: Construct similarity matrix ‘A’ using (1) 

Step 4: Determine Laplacian matrix ‘L’ using (4) 

Step 5: Compute Normalized Laplacian matrix L using (6) 

Step 6: Identify first K eigenvectors of L and denote it as Z using 
(7) 

Step 7: Use k-means to group them into K clusters. 

Step 8: Cluster the data points to cluster DPC, if and only if row i 


of the matrix Z was 
assigned to cluster DPC, 
Step 9: End for 
Step 10: Return Clustering results of data points 
Step 11: End 


Algorithm 1 describes the normalized spectral clustering algorithmic process. By constructing the 
similarity matrix and laplacian matrix, the similar data points are identified. Then, normalized laplacian 
matrix gets constructed and identified k-eigen vectors. After the identification, K-means algorithm is 
employed to group the similar data points to form k-clusters. Thus, the data points in densely populated high 
dimension data significantly grouped in many clusters based on the data type. 


2.2. Vantage point tree for indexing clustered high dimensional data 

In SC-VPTI technique, vantage point tree is used for indexing the clustered high dimensional data. 
Initially, SC-VPTI technique used normalized spectral clustering algorithm for clustering the data points such 
as sea surface temperatures, relative humidity, rainfall, subsurface temperatures air temperature data, etc. 
After clustering the data points, indexing process is carried out by Vantage Point Tree for reducing the space 
complexity. The VP-Tree Indexing High Dimensional Data Process is described in Figure 3. 


El Nino weather dataset 
Spectral Clustering 


VP-tree Indexing 


Figure 3. Process of VP-Tree Indexing High Dimensional Data 


Figure 3 shows the process of VP-tree indexing high dimensional data. The SC-VPTI technique 
used VP-tree for indexing the clustered data points. In VP-tree, the storing of clustered data points is denoted 
by a circle. Each node of VP tree consists of an input point and a radius. All the left children of given node 
are placed inside the circle and all the right children of a given node are placed outside the circle. The tree 
itself not needed to know any information regarding what is stored and its need is the distance function which 
satisfies the metric space properties. A circle is taken into a consideration with a radius. The left children are 
all placed inside the circle and the right children are placed outside the circle. 
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Let us consider El Nino weather dataset is clustered into k number of clusters consist of N data 
points. For each node in tree, a data point cluster is chosen to be the vantage point by Vantage Point 
Selection. Let us consider clustered data point is chosen for the root node is dpc and u be median of distance 
values of all the other clustered data points in DPC, with respect todpc. DPC, is partitioned into two subsets 
of approximately equal sizes as DPC, and DPC, is given by, 


DPC, = {dpceDPC|d(dpc, VP) < u} (8) 
DPC, = {dpceDPC|d(dpc, VP) = u} (9) 


From (8) and (9), d(dpc, VP) symbolizes the distance between data point clusters ‘dpc’ and VP. Each subset 
linked to one node of VP-tree. For each node, a vantage point is chosen to store the clustered data points in 
resultant subset. VP-tree stores many data points at one leaf node. Finally, the whole clustered data point is 
sorted out as balanced tree. 

The VP-tree structure is simple where each node is in form (VP,M, Rptr Lptr). ‘VP? symbolizes 
vantage point and M denotes median distance among all data points indexed below that node whereas Rptr 
and Lytr are pointers of right and left branches respectively. Left branch of node indexes clustered data points 
whose distances from VP are less than or equal to M. Consequently, right branch of node indexes the 
clustered data points whose distances from VP are greater than or equal to M. In leaf nodes, rather than 
pointers to left and right branches, references to clustered data points are kept. The median distance between 
vantage point VP and the clustered data points ‘DPC,’ is determined by, 


d(VP,DPC,) = ve - 38. 0PC. (10) 


From (10), median distance is measured. Given a data set of k clustered data 
points DPC, = {DPC,, DPC,,..,DPC,}, and a median distance function d (VP, DPC), a VP tree is 
constructed by using the following algorithmic process, 


Algorithm 2. VP Tree based Clustered Data Point Indexing Algorithm 
// VP tree based Clustered Data Point Indexing Algorithm 

Input: k Clustered data points ‘DPC, = {DPC,, DPC,,.., DPC} 
Output: Create VP tree for Indexing of Clustered Data Points 

Step 1: Begin 

Step 2: if |DPC|=0, then construct a empty tree 

Step 3: M = median of { d(VP,DPC,))| DPC, € DPC} 

Step 4: For each clustered data point ‘DPC,’ 


Step 5: Randomly select vantage point ‘VP’ 
Step 6: Calculate the distance from vantage point ‘VP’ to the 
data point ‘DPC,’ 
using (10) 
Step 7: Compute mean and variance of distance 
Step 8: if d(VP, DPC) < M, then 
Step 9: Clustered data point ‘DPC,’ is stored in left 


branch of tree 

Step 10: else 

Step 11: Clustered data point ‘DPC,’ is stored in right 
branch of tree 

Step 12: end if 

Step 13: else for 

Step 14: End 


By using the above algorithm 2, clustered data points are efficiently stored in VP tree structure 


based on data type. VP tree indexing minimizes the overlap space and optimizes the retrieval path of index. 
This in turn helps to reduce the space complexity. 
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2.3. VP-tree based data retrieval process 

After indexing the clustered data points, SC-VPTI technique performs VP-tree based data retrieval 
process for efficient data retrieval process based on the user query. Data retrieval is a process of retrieving 
the relevant data from the indexed database based on user requested data. For retrieving the data, user query 
is given as an input. Then, the user queried data are searched and retrieved. Finally, the retrieved data are 
transmitted to the corresponding user. The data retrieval process is shown in below Figure 4. 


Data Point Indexing 


Figure 4. Data rtrieval processes 


Figure 4 explains the block diagram of data retrieval process. For the given user query ‘Q’, set of 
data points that are within the distance ‘r’ of Q are retrieved by search algorithm. The algorithmic process of 
VP-Tree Based Data Retrieval Algorithm is explained below. 


Algorithm 3. VP-Tree Based Data Retrieval Algorithm 
/! VP-Tree Based Data Retrieval Algorithm 
Input: User query Qp = Q1, Q2, Q3, ... Qq, Query range ‘r’, vantage point ‘VP’, and Median distance ‘M’ 
Output: Improved True Positive Rate of Data Retrieval and Reduced Data Retrieval time 
Step 1:Begin 
Step 2: For each User query ‘Qp’ 
Step 3: if d(Q,,VP) <r, then vantage point at the root 


Step 4: ifd(Q,,VP) + r > M, then 

Step 5: Search right branch of tree 

Step 6: else d(Q, ,VP) —r <M, then 
Step 7: Search left branch of tree 

Step 8: End if 


Step 9: End if 

Step 10: if both search conditions are satisfied, then 

Step 11: Both branches of tree is searched for retrieving user queried data points 
Step 12: Display searched data point to user 

Step 12: End if 

Step 13:End for 

Step 14: End 


By using above algorithm 3, SC-VPTI technique efficiently retrieves data points from the VP tree 
indexing database based on the user query. As a result, SC-VPTI technique increases the true positive rate of 
data retrieval and reduces data retrieval time. 


3. EXPERIMENTAL SETTING 

The Spectral Clustering Based VP Tree Indexing (SC-VPTI) Technique is implemented in Java 
Language with aid of El Nino dataset from UCI machine learning repository. The El Nino dataset comprises 
the oceanographic and surface meteorological readings from sequence of buoys sited all over the equatorial 
Pacific. The data is predictable to assist in and prediction of El Nino/Southern Oscillation (ENSO) cycles. 
The dataset characteristics are spatio-temporal and attribute characteristics is both real and integer. In 
addition, number of instances are 178080 and number of attributes are 12. El Nino dataset includes the 
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attributes like date, latitude, longitude, zonal winds (west<0, east>0), meridional winds (south<0, north>0), 
relative humidity, air temperature, sea surface temperature and subsurface temperatures down to a depth of 
500 meters. 


4. RESULTS AND DISCUSSIONS 

The result analysis of SC-VPTI technique is compared against with existing two approaches namely 
Locality-Sensitive Hashing (LSH) [1] and incremental semi supervised clustering ensemble (ISSCE) [2] 
respectively. The performance of SC-VPTI technique is evaluated on various factors such as space 
complexity, data retrieval time and true positive rate with help of tables and graphs. 


4.1. Space complexity 

Space complexity is defined as the amount of memory space required for clustering and indexing 
the densely populated high dimensional data. The space complexity is measured in terms of 
Mega Bytes (MB) and formulated as, 


Space complexity = n * memory for storing one clusered object (11) 


From (11), ‘n’ denotes the number of data points taken for clustering process. When the space complexity is 
lesser, the technique is said to be more efficient. 

Table | describes the space complexity values obtained based on different number of data points 
taken in the range of 50-500. From the table value, proposed SC-VPTI technique has lesser space complexity 
during clustering and indexing the densely populated high dimensional data points when compared to LSH 
Technique and ISSCE Approach respectively. Besides, when the number of data points during clustering and 
indexing process increases, the space complexity also gets increased in all three methods. 


Table 1.Tabulation for Space Complexity 


od ai Space Complexity (MB) 

Dumberer Data rein: LSH Technique ISSCE Approach SC-VPTI technique 
50 26.36 23.78 14.32 
100 28.12 25.14 16.34 
150 29.89 27.96 17.98 
200 31.78 29.17 19.23 
250 33.98 31.54 21.59 
300 35.63 33.98 23.87 
350 37.89 34.52 25.98 
400 39.27 37.12 27.45 
450 41.96 38.33 29.75 
500 42.34 40.15 31.47 


But, the space complexity using proposed SC-VPTI technique is lesser. This is because of 
application of normalized spectral clustering algorithm and VP based Clustered Data Point Indexing 
Algorithm in SC-VPTI technique where it efficiently group and index the high dimensional data. In 
normalized spectral clustering algorithm, the similarity matrix and laplacian matrix are constructed to 
identify similar data points. Followed by, the K-means algorithm is applied to group the similar data points to 
construct k-clusters. By applying an indexing algorithm, set of data points that are within the distance are 
correctly indexed in right and left branches respectively. In VP tree, left branch of node indexes clustered 
data points whose distances from vantage point are less than or equal to Median distance. Accordingly, right 
branch of node indexes the clustered data points whose distances from vantage point are greater than or equal 
to Median distance. Based on indexing algorithm, the densely populated clustered data are stored in an 
efficient manner with less space complexity. As a result, proposed SC-VPTI technique reduces the space 
complexity of densely populated high dimensional data by 35% as compared to LSH Technique [1] and 30% 
as compared to ISSCE Approach [2] respectively. 


4.2. True positive rate 


True positive rate (TPR) of data retrieval is described as the ratio of number of correctly retrieved 
data points based on user query to the total number of data points. The true positive rate of data retrieval is 
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measured in terms of percentages (%) and formulated as, 


number of correctly retrieved data points based on user query 
PRR "100 


: (12) 
total number of data points 
when the true positive rate is higher, the technique is said to be more efficient. 
Figure 5 portrays the true positive rate measure of densely populated high dimensional data versus 
number of data points in range of 50-500. From figure, proposed SC-VPTI technique has higher true positive 
rate during retrieving the data points based on the user query from the indexing database when compared to 
LSH Technique and ISSCE Approach respectively. In addition, when the number of data points during 
clustering and indexing process increases, the true positive rate also gets increased in all three methods. 
However, the true positive rate using proposed SC-VPTI technique is higher. This is because of application 
of Vantage Point based Clustered Data Point Indexing Algorithm and Vantage Point based Data Retrieval 
Algorithm in SC-VPTI technique where it efficiently searches and retrieves the exact user requested data. 


E LSH Technique 
SISSCE Approach 
E SC-VPTI technique 


True Positive Rate (%) 


Number of Data Points 


Figure 5. Measurement of True Positive rate 


The vantage point tree is constructed for indexing the clustered data points and it stored in leaf and 
right branch of tree. After that, Data retrieval is a process of retrieving the similar data from the indexed 
database based on user query requested data. For retrieving the data points, both branches of the vantage 
point tree are searched and displayed the data points to users according to their user requirements. This helps 
to correctly retrieve the similar data points in order to archive high true positive rate in an efficient way. As a 
result, proposed SC-VPTI technique increases the true positive rate of densely populated high dimensional 
data by 22% as compared to LSH Technique [1] and 12% as compared to ISSCE Approach [2] respectively. 


4.3. Data retrieval time 
Data Retrieval Time is defined as amount of time taken for retrieving the data points from the 
indexing database. It is measured in terms of milliseconds (ms). Data Retrieval Time is formulated as, 


Data Retrieval Time = n * time for retrieving data points (13) 


From (13), ‘n’ represents number of data points. When the data retrieval time is lesser, the method is said to 
be more efficient. 

Figure 6 describes the data retrieval time measure of densely populated high dimensional data 
versus number of data points in range of 50-500. From figure, proposed SC-VPTI technique consumes lesser 
time during retrieving the data points based on the user query from the indexing database when compared to 
LSH Technique and ISSCE Approach respectively. In addition, when the number of data points during 
clustering and indexing process increases, the data retrieval time also gets increased in all three methods. 
However, the data retrieval time using proposed SC-VPTI technique is lesser. This is due to the Vantage 
Point based Clustered Data Point Indexing Algorithm and Retrieval Algorithm in SC-VPTI technique where 
it efficiently searches data and retrieves with minimal time. 

An indexing algorithm effectively stores the data with two different branches namely left and right 
and it is denotes as circles. This helps to effectively store the clustered high dimensional data in these two 
branches of node. After indexing the data, data retrieval from index database is carried out using VP-Tree 
Based Data Retrieval Algorithm. For each requested user query, the similar data points are searched from the 
indexing database. This helps to reduce the data retrieval time of densely populated high dimensional data. 
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As a result, proposed SC-VPTI technique reduces data retrieval time of densely populated high dimensional 
data by 17% as compared to LSH Technique [1] and 30% as compared to ISSCE Approach [2] respectively. 


@ 70 
E 60 
= 40 
z 30 = LSH Technique 
f= 20 m ISSCE Approach 
z2 
~ 10 = SC-VPTI technique 
z 0 
=) cooeoocoeoeo.se 
ASHSHSEKHSEUNS 
mae NNMKMMTTHN 
Number of Data Points 


Figure 6. Measurement of Data Retrieval Time 


5. RELATED WORKS 

A surprising simple method was introduced in [12] for addressing the ANN issues with high 
accuracy results and needs lesser number of random I/O. But, a binary index structure reduces the space and 
it failed to consider the performance of true positive rate in the process of data retrieval. A new semi- 
supervised hashing method was introduced in [13] with pairwise supervised information comprising of must- 
link and cannot-link. The designed method increased the information provided by every bit along with 
labeled data and the unlabeled data. A clustering algorithm called SUBSCALE was introduced in [14] to 
identify the non-trivial subspace clusters with lesser cost and it needed only k database scans for k- 
dimensional datasets. 

A new penalized forward selection technique in [15] minimized high dimensional optimization 
issues to many one dimensional optimization issues through selecting the best predictor. But, the data 
retrieval time was not reduced using penalized forward selection technique. Constraint-Partitioning K-Means 
(COP-KMEANS) clustering algorithm was introduced in [16] for clustering high dimensional data and to 
minimize the cost through removing the noisy dimensions. Predictive Subspace Clustering (PSC) was 
introduced in [17] for clustering the high-dimensional data. But, PSC is not suitable for densely populated 
high dimensional data points. 

Discriminative Embedded Clustering (DEC) was carried out in [18] that combines the subspace 
learning and clustering. However, DEC consumed large amount of time for data retrieval. H-K clustering 
algorithm was designed in [19] to minimize the space complexity during high dimensional data clustering. 
Hierarchical Accumulative Clustering Algorithm was introduced in [20] to cluster the high dimensional data 
with higher clustering accuracy. However, the designed algorithm needs large amount of memory space. A 
robust multi objective subspace clustering (MOSCL) algorithm was presented in [21] for high-dimensional 
clustering with higher accuracy of subspace clustering. But, the space complexity remained unaddressed 
using MOSCL algorithm.Graph-based clustering was developed in [22] to cluster the web search results with 
high clustering quality. However, the densely populated clustering on high dimensional data was not 
performed. An incremental-clustering approach was developed in [23] for constructing a cluster based on 
selecting an optimal threshold value. But, efficient data retrieval was not performed with minimum time. 


6. CONCLUSION 

An efficient Spectral Clustering Based VP Tree Indexing (SC-VPTI) Technique is developed to 
enhance the data retrieval performance based on user query with lesser space complexity and higher true 
positive rate. Existing locality sensitive hashing (LSH) techniques employed for near-neighbor search issues 
but it failed to address retrieval of high dimensional data. An incremental semi supervised clustering 
ensemble approach not considered the retrieval process. These problems are addressed by using SC-VPTI 
Technique. Three processing steps are presented for improving the high dimensional data clustering. At first, 
Normalized Spectral Clustering technique in SC-VPTI technique groups the similar high dimensional data 
points to form clusters based on similarity matrix which comprises a quantitative estimation for each pair of 
data in dataset. After that, vantage point tree indexing is performed for clustering the data points. These 
points are stored in left and right branches of tree. This helps to reduce the space complexity. Finally, the 
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indexed data gets retrieved based on the user query by Vantage Point Tree construction. The efficiency of 
SC-VPTI technique is evaluated with two exiting methods in terms of space complexity, true positive rate 
and data retrieval time. The experimental results show that SC-VPTI technique provides better performance 
with an enhancement of true positive rate of data retrieval rate with minimum retrieval time as well as space 
complexity when compared to state-of-the-art works. 
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